March 4, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Images represent a structured input that are difficult for many machine learning methods
Each colored instance is a \(3 \times H \times W\) tensor input
Location matters! Images are all about spatial context - a cat is a cat regardless of which way it is facing!
A lot of “features” per instance - a \(3 \times 32 \times 32\) image has 3072 pixel values!
Localized task:
Within an image, find the object of interest
Classify
Surround with four point bounding box
There can be multiple objects!
Use crops of the original image
Classify each crop
Determine the IoU with the true bounding box
Fortunately, we don’t have to fully train a YOLO segmenter ourselves!
A little different than normal since we’ll use an external package by Ultralytics
There’s an industry for commercial usage of YOLO
Ultralytics is a company that manages and creates wrappers for YOLO models
Open license for individual/research use
Paid for commercial
Some criticism since YOLO was developed open source, but this is just the way it works
Label each pixel in the image with an appropriate category label!
:::
This approach is essentially combining classification with object segmentation!
Super expensive:
Won’t effectively use neighborhood information
Clever solution: the U-Net
This is a different architecture than we’ve seen before!
End-to-end convolutions - no dense layers for classification. Will all be handled by convolution operations.
Start with the high-res image and downsample (strided convolution or pooling) to get a many channel low-res feature map of the original image (same as usual)
At the low-res bottleneck, upsample back to the original image size substituting color vectors with a label in our vocabulary of objects (cow, sky, grass, trees).
Strided Convolution and Max Pooling downsample the original image
Increase the number of channels
Each channel corresponds to some feature of the image
For classification, detect if feature exists
We lose that information at the end of the convolutional layers!!!
When we downsample:
Reduce the resolution of the original image concentrating on different parts of the image
Have more less crisp feature maps that correspond to different parts of the image
For classification, we only need to know individual parts!
Does the object have a wing?
Does the object have a bird-head?
For semantic segmentation, we start by downsampling to break the image into parts and determine if it has certain parts:
But, knowing that there is a cow doesn’t tell us where the cow is!
The clever part of U-Net: take the broken down image parts and reconstruct them into a map that corresponds back to original image!
Do this in a way that preserves the information learned about the parts of the image (does the image have cow parts? does the image have cat parts? Sky? Grass?)
And localizes the knowledge back to the original image locations!
This may seem like a fool’s errand, but remember that our ultimate goal is to take a color pixel and translate it to a class value!
\[ [(0,255),(0,255),(0,255)] \rightarrow (0,1,2,...,C) \]
A much smaller set of possible values than the original input
Less detailed than a color value
Recall that convolution with stride 1 returns a smaller matrix:
\[ \underset{(H \times H)}{X} \circledast \underset{(f \times f)}{K} = \underset{(H - f + 1) \times (H - f + 1)}{C} \]
With higher stride, the resulting matrix gets even smaller!
Is it possible for us to stride by less than 1?
Transposed convolution is the main method of upsampling!
Stride 2 (on the output matrix):
\[ \left[\begin{array}{cc}\color{blue}0 & 1 \\2 & 3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]
\[ \left[\begin{array}{cccc}\color{blue}0 & \color{blue}0 & & \\\color{blue}0 & \color{blue}0 & & \\ & & & \\ & & & \end{array}\right] \]
Stride 2 (on the output matrix):
\[ \left[\begin{array}{cc} 0 & \color{blue}1 \\2 & 3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]
\[ \left[\begin{array}{cccc}0 & 0 & \color{blue}0 & \color{blue}1 \\0 & 0 & \color{blue}2 & \color{blue}3 \\ & & & \\ & & & \end{array}\right] \]
Stride 2 (on the output matrix):
\[ \left[\begin{array}{cc} 0 & 1 \\\color{blue}2 & 3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]
\[ \left[\begin{array}{cccc}0 & 0 & 0 & 1 \\0 & 0 & 2 & 3 \\ \color{blue}0& \color{blue}2 & & \\ \color{blue}4& \color{blue}6& & \end{array}\right] \]
Stride 2 (on the output matrix):
\[ \left[\begin{array}{cc} 0 & 1 \\2 & \color{blue}3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]
\[ \left[\begin{array}{cccc}0 & 0 & 0 & 1 \\0 & 0 & 2 & 3 \\ 0& 2 &\color{blue}0 &\color{blue}3 \\ 4& 6&\color{blue}6 &\color{blue}9 \end{array}\right] \]
Transposed convolution:
Passes the input over the filter elementwise
Preserves the scale in the corresponding block of the output matrix
Striding is w.r.t. the output matrix not the input matrix
Results in a larger output than input.
We can think of this method as a learnable linear interpolation method
Start with the low res image
Make it bigger by finding the appropriate values linearly between the associated input pixels
However, the weights for the interpolation are learned via the transposed convolution filters!
Clever solution: the U-Net
The idea of the U-Net:
Encode the original image in a feature space that tells us what parts are present
Upsample in a clever way that allows us to decode the encoded space in pixel-wise class space (0,1,2,…,C).
This is our first example of a deep encoder-decoder architecture!
A common method of taking a complex input and putting it into a space that can be altered to get a complex output!
Think PCA but for images or text!
An encoder
A hidden state
A decoder
Example: Spanish \(\rightarrow\) Meaning \(\rightarrow\) English
Example: RGB Image \(\rightarrow\) Essence \(\rightarrow\) Pixel Map!
The base U-Net Architecture:
Structure:
Start with the input image and pass it through a few convolutional layers
Max pool to reduce size
Repeat upping the number of convolutional filters
At the low-res bottleneck, switch max pooling to transposed convolution
Pass it through a few convolutional layers
Keep passing it through transposed convolutions and convolutional layers until we get back to the original image size!
The final layer then has probabilities that each pixel belongs to a semantic class!
The problem: Without further supervision, we end up with good classification, but relatively poor localization.
This is where the skip connections come in
For each upsampling layer, concatenate the set of feature maps with the downsampling layer feature maps that have corresponding size
The bottleneck loses a lot of information about where the objects are located in the images
The layers above have a lot of that info!
Pair each upsampling layer with the corresponding downsampling layer to share information
U-Nets represent one state of the art method for semantic segmentation!
Loss functions:
Pixel wise cross entropy loss (each pixel belongs to a class)
Intersection over Union for all pixel values other than false background classes:
\[ \frac{\text{True Positives}}{\text{True Positives + False Positives + True Negatives}} \]
Goal: Detect all objects in the image and identify the pixels that belong to each object
Approach: Perform object detection, then predict a segmentation mask for each object.
Intro method is called Mask R-CNN
New hotness is the Segment Anything Model
The Segment Anything Model deals with the fact that segmentation models require a lot of hard to label data
All images used to train a model must be associated with a segmentation mask for all possible pixel classes that will be predicted
Pre-trained segmenters are kind of hard to do
Require a lot of tagged data
Can you think of any companies that might have access to a lot of photos where users voluntarily tag parts of the image and provide lots of examples of images?
Meta’s SAM is a new approach to segmentation that uses a very large data set of 11 million images with 1.1 billion masks to train an image encoder and decoder that does a good job of finding object “blobs” regardless of class label
Just learns to find blobs like a CNN and translate them back out as their own class
Whatever that may be
SAM can also be prompted to find masks for specific things.
For example - given a point, find the object that includes that point.
For any point, returns a series of “likely masks”
Can be used to determine whole vs. parts
Freely available model that can be used!
A final discussion for today:
When, we train a classifier for an image, \(\mathbf x\), we learn a discriminative distribution
\[ P(y = c | \mathbf x) \]
As we’ve seen, these classifiers can be really good!
Given a value of \(\mathbf x\), we can do a really good job of determining whether or not it includes a bird
What if we wanted to reverse this conditional?
Instead of learning the class from an image, what if we wanted to learn the image from a class label
\[ P(\mathbf x | y = c) \]
We have Bayes Theorem
\[ P(\mathbf x | y = c) = \frac{1}{Z} P(y = c | \mathbf x)P(\mathbf x) \]
The CNN classifier gives us the first part (the likelihood)
The second part will be an image prior
The tricky bit: it’s not easy to assess the probability of an image and how do we know what direction to go?
The goal:
Starting with a random image, work our way towards the one that maximizes the posterior probability that we would see \(\mathbf x\) given banana!
We can think of this as a posterior maximization (MAP) problem where the posterior is:
\[ \log P(y = c | \mathbf x) + \log P(\mathbf x) \]
The unadjusted Langevin algorithm is a variation on the Metropolis-Hastings algorithm that samples from posterior distributions while taking into account local gradient information
MH sorta randomly walks around the space
Metropolis-adjusted Langevin moves around the space in directions that are likely to increase the posterior probability but rejects some moves
The unadjusted variant rejects no moves! It just always moves, for better or for worse
Ngu (2017) showed that we could propose a reasonable sequence of steps from any starting image to the posterior maximizer using the following update algorithm:
\[ \mathbf x_{t + 1} = \mathbf x_t + \epsilon_1 \frac{\partial \log P(\mathbf x_t)}{\partial \mathbf x_t} + \epsilon_2 \frac{\partial \log P(y = c | \mathbf x_t)}{\partial \mathbf x_t} \]
One of these derivatives is a by-product of the training procedure
Super fast autodiff method give us the second derivative with relatively low computational time!
The last part is just coming up with an image prior
Priors should bake in our prior beliefs about what an image should look like
Any general rules we should follow when considering how likely a pixel is given the other pixels?
Think about neighbors
A smart differentiable prior is called the total variation prior. For a pixel \(x_{ijk}\), we look at its neighbors in the same color channel and define the prior likelihood that it is a good guess as:
\[ (x_{i,j,k} - x_{i+1,j,k})^2 + (x_{i,j,k} - x_{i,j+1,k})^2 \]
The total variation prior is then:
\[ TV(\mathbf x) = \sum_{i,j,k} (x_{i,j,k} - x_{i+1,j,k})^2 + (x_{i,j,k} - x_{i,j+1,k})^2 \]
The log of this is then maximized when all pixel values are equal!
But, the log of the likelihood is going to maximized when the picture looks most banana like even if things away from the banana are incoherent
The sum of these two elements will then strike a balance between smoothness and “banana-ness”
This process is computationally intense
Assess the gradient for each proposal image
Takes a lot of moves to go from random to banana
Too intense for my workstation!
This intensity led to other image generation architectures that are more prevalent today
Image transformers, diffusion, GANs, Variational Autoencoders
This method has seen some usage though!
Example 1: Deep Dream
Using the TV prior approach, generate images that overstate certain aspects of the training data
Note that all of our examples have been on dogs. This is a common ML thing. What if we used these dog pictures to try to generate new “art”?
A picture of dogs playing poker generated from dogs in the ImageNet data set
Gary Busey generated from a bunch of pictures of Gary Busey
“Nevermind” but Wildlife
Example 2: Neural Style Transfer
Given an image with content and an image with a certain style, find and image that is close to the content in the style of the other one!
This is actually a pretty easy to understand method - we just don’t have time to cover it in depth this class.
See PML 14.6.5 for the math of how this works!
We just scratched the surface of what CNNs can do for us!
Classify images
Bounding boxes
Pixel labelling
Generation
The list goes on and on.
We’ll come back to CNNs when we talk about generative models in a couple of weeks!
Next time, we’ll start discussing sequence models
One class (max) on RNNs
Attention
Transformers
GPT and Autoregressive Generation/Pixel RNNs